Setup Database

library(pacman)
p_load(dplyr, DBI)

# Create a database table in memory
farmers.db <- 
  dbConnect(RMariaDB::MariaDB(), 
            group = "/Users/nickjenkins/Documents/Workshops & Conferences/SQL for Data Scientists/SQL for Data Scientists support files/FarmersMarketDatabase.sql",
            username = "root",
            password = "Hockeydude25",
            host = "localhost",
            port = 3306)

Creating Machine Learning Datasets Using SQL

In this chapter, we’ll discuss the development of datasets for two types of algorithms: classification and time series models.

A binary classification model predicts whether a record belongs to one category or another. In order to make predictions, the model needs to be trained. Binary classifiers are a type of supervised learning model, which means thy are trained by passing example rows of data labeled with each of the possible outcomes into the algorithm, so it can detect patterns and identify characteristics that are more strongly associated with one result or the other.

A time series model preforms statistical operations on a series of measurements over time to forecast what the measurement might be at some point in time in the future.

Datasets for Time Series Models

The simplest time series forecast uses a single variable measured over specified time intervals to predict the value of that same variable at a future point in time. For example, a model to predict the high temperature in a location tomorrow could have a dataset with year’s worth of daily high temperatures measured at that location. The dataset would have one row per day, with one column for the date, and another for the daily high temperature measured. A time series algorithm could detect seasonal temperature patterns, long-term trends, and the most recent daily high temperatures to predict what the high temperature might be tomorrow.

Let’s create a dataset that allows us to plot a time series of farmer’s market sales per week.

SELECT
  MIN(cp.market_date) AS first_market_date_of_week,
  ROUND(SUM(cp.quantity * cp.cost_to_customer_per_qty), 2) AS weekly_sale
FROM farmers_market.customer_purchases AS cp
LEFT JOIN farmers_market.market_date_info AS md
  ON cp.market_date = md.market_date
GROUP BY md.market_year, md.market_week
ORDER BY md.market_year, md.market_week

R also has packages for exponential smoothing models, so we’ll use R to generate a sales forecast for the eight weeks beyond the last date in our dataset.

ts.data %>% 
  plot_time_series(date, weekly_sale)
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
LS0tCnRpdGxlOiAiQ2hhcHRlciAxMjogQ3JlYXRpbmcgTWFjaGluZSBMZWFybmluZyBEYXRhc2V0cyBVc2luZyBTUUwiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgU2V0dXAgRGF0YWJhc2UKCmBgYHtyIHNldHVwfQpsaWJyYXJ5KHBhY21hbikKcF9sb2FkKGRwbHlyLCBEQkkpCgojIENyZWF0ZSBhIGRhdGFiYXNlIHRhYmxlIGluIG1lbW9yeQpmYXJtZXJzLmRiIDwtIAogIGRiQ29ubmVjdChSTWFyaWFEQjo6TWFyaWFEQigpLCAKICAgICAgICAgICAgZ3JvdXAgPSAiL1VzZXJzL25pY2tqZW5raW5zL0RvY3VtZW50cy9Xb3Jrc2hvcHMgJiBDb25mZXJlbmNlcy9TUUwgZm9yIERhdGEgU2NpZW50aXN0cy9TUUwgZm9yIERhdGEgU2NpZW50aXN0cyBzdXBwb3J0IGZpbGVzL0Zhcm1lcnNNYXJrZXREYXRhYmFzZS5zcWwiLAogICAgICAgICAgICB1c2VybmFtZSA9ICJyb290IiwKICAgICAgICAgICAgcGFzc3dvcmQgPSAiSG9ja2V5ZHVkZTI1IiwKICAgICAgICAgICAgaG9zdCA9ICJsb2NhbGhvc3QiLAogICAgICAgICAgICBwb3J0ID0gMzMwNikKYGBgCgojIENyZWF0aW5nIE1hY2hpbmUgTGVhcm5pbmcgRGF0YXNldHMgVXNpbmcgU1FMCgpJbiB0aGlzIGNoYXB0ZXIsIHdlJ2xsIGRpc2N1c3MgdGhlIGRldmVsb3BtZW50IG9mIGRhdGFzZXRzIGZvciB0d28gdHlwZXMgb2YgYWxnb3JpdGhtczogY2xhc3NpZmljYXRpb24gYW5kIHRpbWUgc2VyaWVzIG1vZGVscy4KCkEgKmJpbmFyeSBjbGFzc2lmaWNhdGlvbiBtb2RlbCogcHJlZGljdHMgd2hldGhlciBhIHJlY29yZCBiZWxvbmdzIHRvIG9uZSBjYXRlZ29yeSBvciBhbm90aGVyLiBJbiBvcmRlciB0byBtYWtlIHByZWRpY3Rpb25zLCB0aGUgbW9kZWwgbmVlZHMgdG8gYmUgdHJhaW5lZC4gQmluYXJ5IGNsYXNzaWZpZXJzIGFyZSBhIHR5cGUgb2YgKnN1cGVydmlzZWQgbGVhcm5pbmcqIG1vZGVsLCB3aGljaCBtZWFucyB0aHkgYXJlIHRyYWluZWQgYnkgcGFzc2luZyBleGFtcGxlIHJvd3Mgb2YgZGF0YSBsYWJlbGVkIHdpdGggZWFjaCBvZiB0aGUgcG9zc2libGUgb3V0Y29tZXMgaW50byB0aGUgYWxnb3JpdGhtLCBzbyBpdCBjYW4gZGV0ZWN0IHBhdHRlcm5zIGFuZCBpZGVudGlmeSBjaGFyYWN0ZXJpc3RpY3MgdGhhdCBhcmUgbW9yZSBzdHJvbmdseSBhc3NvY2lhdGVkIHdpdGggb25lIHJlc3VsdCBvciB0aGUgb3RoZXIuCgpBICp0aW1lIHNlcmllcyBtb2RlbCogcHJlZm9ybXMgc3RhdGlzdGljYWwgb3BlcmF0aW9ucyBvbiBhIHNlcmllcyBvZiBtZWFzdXJlbWVudHMgb3ZlciB0aW1lIHRvIGZvcmVjYXN0IHdoYXQgdGhlIG1lYXN1cmVtZW50IG1pZ2h0IGJlIGF0IHNvbWUgcG9pbnQgaW4gdGltZSBpbiB0aGUgZnV0dXJlLgoKIyBEYXRhc2V0cyBmb3IgVGltZSBTZXJpZXMgTW9kZWxzCgpUaGUgc2ltcGxlc3QgdGltZSBzZXJpZXMgZm9yZWNhc3QgdXNlcyBhIHNpbmdsZSB2YXJpYWJsZSBtZWFzdXJlZCBvdmVyIHNwZWNpZmllZCB0aW1lIGludGVydmFscyB0byBwcmVkaWN0IHRoZSB2YWx1ZSBvZiB0aGF0IHNhbWUgdmFyaWFibGUgYXQgYSBmdXR1cmUgcG9pbnQgaW4gdGltZS4gRm9yIGV4YW1wbGUsIGEgbW9kZWwgdG8gcHJlZGljdCB0aGUgaGlnaCB0ZW1wZXJhdHVyZSBpbiBhIGxvY2F0aW9uIHRvbW9ycm93IGNvdWxkIGhhdmUgYSBkYXRhc2V0IHdpdGggeWVhcidzIHdvcnRoIG9mIGRhaWx5IGhpZ2ggdGVtcGVyYXR1cmVzIG1lYXN1cmVkIGF0IHRoYXQgbG9jYXRpb24uIFRoZSBkYXRhc2V0IHdvdWxkIGhhdmUgb25lIHJvdyBwZXIgZGF5LCB3aXRoIG9uZSBjb2x1bW4gZm9yIHRoZSBkYXRlLCBhbmQgYW5vdGhlciBmb3IgdGhlIGRhaWx5IGhpZ2ggdGVtcGVyYXR1cmUgbWVhc3VyZWQuIEEgdGltZSBzZXJpZXMgYWxnb3JpdGhtIGNvdWxkIGRldGVjdCBzZWFzb25hbCB0ZW1wZXJhdHVyZSBwYXR0ZXJucywgbG9uZy10ZXJtIHRyZW5kcywgYW5kIHRoZSBtb3N0IHJlY2VudCBkYWlseSBoaWdoIHRlbXBlcmF0dXJlcyB0byBwcmVkaWN0IHdoYXQgdGhlIGhpZ2ggdGVtcGVyYXR1cmUgbWlnaHQgYmUgdG9tb3Jyb3cuCgpMZXQncyBjcmVhdGUgYSBkYXRhc2V0IHRoYXQgYWxsb3dzIHVzIHRvIHBsb3QgYSB0aW1lIHNlcmllcyBvZiBmYXJtZXIncyBtYXJrZXQgc2FsZXMgcGVyIHdlZWsuCgpgYGB7c3FsIGNvbm5lY3Rpb24gPSBmYXJtZXJzLmRiLCBvdXRwdXQudmFyID0gInRzLmRhdGEifQpTRUxFQ1QKICBNSU4oY3AubWFya2V0X2RhdGUpIEFTIGZpcnN0X21hcmtldF9kYXRlX29mX3dlZWssCiAgUk9VTkQoU1VNKGNwLnF1YW50aXR5ICogY3AuY29zdF90b19jdXN0b21lcl9wZXJfcXR5KSwgMikgQVMgd2Vla2x5X3NhbGUKRlJPTSBmYXJtZXJzX21hcmtldC5jdXN0b21lcl9wdXJjaGFzZXMgQVMgY3AKTEVGVCBKT0lOIGZhcm1lcnNfbWFya2V0Lm1hcmtldF9kYXRlX2luZm8gQVMgbWQKICBPTiBjcC5tYXJrZXRfZGF0ZSA9IG1kLm1hcmtldF9kYXRlCkdST1VQIEJZIG1kLm1hcmtldF95ZWFyLCBtZC5tYXJrZXRfd2VlawpPUkRFUiBCWSBtZC5tYXJrZXRfeWVhciwgbWQubWFya2V0X3dlZWsKYGBgCgpSIGFsc28gaGFzIHBhY2thZ2VzIGZvciBleHBvbmVudGlhbCBzbW9vdGhpbmcgbW9kZWxzLCBzbyB3ZSdsbCB1c2UgUiB0byBnZW5lcmF0ZSBhIHNhbGVzIGZvcmVjYXN0IGZvciB0aGUgZWlnaHQgd2Vla3MgYmV5b25kIHRoZSBsYXN0IGRhdGUgaW4gb3VyIGRhdGFzZXQuCgpgYGB7cn0KcF9sb2FkKHRpZHl2ZXJzZSwgbW9kZWx0aW1lLCBsdWJyaWRhdGUsIHRpZHltb2RlbHMsIHhnYm9vc3QsIHRpbWV0aykKCiMgc2VyaWVzIHBsb3QKdHMuZGF0YSAlPiUgCiAgbXV0YXRlKGRhdGUgPSB5bWQoZmlyc3RfbWFya2V0X2RhdGVfb2Zfd2VlaykpICU+JSAKICBnZ3Bsb3QoYWVzKHggPSBkYXRlLCB5ID0gd2Vla2x5X3NhbGUpKSArCiAgZ2VvbV9saW5lKCkgKwogIHNjYWxlX3lfY29udGludW91cyhsaW1pdHMgPSBjKDAsIDE4MDApLCAKICAgICAgICAgICAgICAgICAgICAgYnJlYWtzID0gc2VxKGZyb20gPSAwLCB0byA9IDE4MDAsIGJ5ID0gMjAwKSkgKwogIHNjYWxlX3hfZGF0ZShkYXRlX2JyZWFrcyA9ICI0IHdlZWtzIikgKwogIHRoZW1lX21pbmltYWwoKSArCiAgdGhlbWUoYXhpcy50ZXh0LnggPSBlbGVtZW50X3RleHQoYW5nbGUgPSA5MCkpCgp0cy5kYXRhIDwtIAogIHRzLmRhdGEgJT4lIAogIG11dGF0ZShkYXRlID0geW1kKGZpcnN0X21hcmtldF9kYXRlX29mX3dlZWspKQoKdHMuZGF0YSAlPiUgCiAgcGxvdF90aW1lX3NlcmllcyhkYXRlLCB3ZWVrbHlfc2FsZSkKCnNwbGl0cyA8LSBpbml0aWFsX3RpbWVfc3BsaXQodHMuZGF0YSwgcHJvcCA9IDAuOSkKCiMgZm9yZWNhc3QKZXhwLm1vZGVsIDwtIAogIGV4cF9zbW9vdGhpbmcoKSAlPiUgCiAgc2V0X2VuZ2luZShlbmdpbmUgPSAiZXRzIikgJT4lIAogIGZpdCh3ZWVrbHlfc2FsZSB+IGRhdGUsIGRhdGEgPSB0cmFpbmluZyhzcGxpdHMpKQoKbW9kZWwudGJsIDwtIG1vZGVsdGltZV90YWJsZShleHAubW9kZWwpCgptb2RlbC50YmwgJT4lCiAgbW9kZWx0aW1lX2NhbGlicmF0ZShuZXdfZGF0YSA9IHRlc3Rpbmcoc3BsaXRzKSkgJT4lIAogIG1vZGVsdGltZV9mb3JlY2FzdChuZXdfZGF0YSA9IHRlc3Rpbmcoc3BsaXRzKSwKICAgICAgICAgICAgICAgICAgICAgYWN0dWFsX2RhdGEgPSB0cy5kYXRhKSAlPiUKICBwbG90X21vZGVsdGltZV9mb3JlY2FzdCguaW50ZXJhY3RpdmUgPSBGQUxTRSkKYGBgCg==